Robust text line detection in historical documents: learning and evaluation methods

نویسندگان

چکیده

Text line segmentation is one of the key steps in historical document understanding. It challenging due to variety fonts, contents, writing styles and quality documents that have degraded through years. In this paper, we address limitations currently prevent people from building models with a high generalization capacity. We present study conducted using three state-of-the-art systems Doc-UFCN, dhSegment ARU-Net show it possible build generic trained on wide datasets can correctly segment diverse unseen pages. This paper also highlights importance annotations used during training: Each existing dataset annotated differently. unification its positive impact final text recognition results. end, complete evaluation strategy standard pixel-level metrics, object-level ones introducing goal-oriented metrics.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text line detection in handwritten documents

Article history: Received 13 April 2007 Received in revised form 26 March 2008

متن کامل

Robust Line Detection in Historical Church Registers

For being able to automatically acquire information recorded in church registers and other historical scriptures, the text of such documents needs to be segmented prior to automatic reading. Segmentation of old handwritten scriptures is difficult for two main reasons. Lines of text in general are not straight and ascenders and descenders of adjacent lines interfere. The algorithms described in ...

متن کامل

A Two-Stage Method for Text Line Detection in Historical Documents

This work presents a two-stage text line detection method for historical documents. In a first stage, a deep neural network called ARU-Net labels pixels to belong to one of the three classes: baseline, separator or other. The separator class marks beginning and end of each text line. The ARU-Net is trainable from scratch with manageably few manually annotated example images (less than 50). This...

متن کامل

Skew detection and text line position determination in digitized documents

-This paper proposes a computationally efficient procedure for skew detection and text line position determination in digitized documents, which is based on the cross-correlation between the pixels of vertical lines in a document. The determination of the skew angle in documents is essential in optical character recognition systems. Due to the text skew, each horizontal text line intersects a p...

متن کامل

Text Extraction from Historical Handwritten Documents by Edge Detection

Many national archives or libraries keep large amount of historical handwritten documents. One problem that many archivists are facing is the sipping of ink through the pages of certain double-sided handwritten documents after long periods of storage. The result is that the handwritten characters from the reverse side appear as noise on the front side and even interfere with the front side char...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: International Journal on Document Analysis and Recognition

سال: 2022

ISSN: ['1433-2833', '1433-2825']

DOI: https://doi.org/10.1007/s10032-022-00395-7